1. Introduction

This is my attempt at entering the kaggle Digit Recognizer competition:

https://www.kaggle.com/c/digit-recognizer

The kaggle data incorporates approx.40,000 entries with avg_pixelss from 0-255 with 785 columns or pixel positions. There is an even distribution of entries for the 10 numbers/outcomes (0-9).

The data has already been processed in Python as follows:

  1. all columns with constant values have been removed (< 200 columns)

  2. the data was grouped by number/outcome and all columns with pixels <= 20 were deleted. This cuts down the total column names to:

## [1] 434
  1. A new dataframe has been created with 3 fields:

This should help me visualise if any further columns should be removed or whether or not I can create any composite features.

2. Pixel distribution

I then drew a few graphs to get a handle on the data.

  1. Histograms: column counts by number

  1. Histograms: average pixel value counts by number

  1. Column name v average pixel value by number (jittered so you can see density)

  1. Column name v average pixel value - all numbers

There seem to be a few patterns here:

0 - more than 40 avg_pixels 150-175 range 1 - only number avg_pixels over 225 6 - only number with avg_pixels under 20?

3. More Questions

  1. Are the datapoints distributed fairly evenly amongst the column names? I tested this by cutting the column_names into twelve bins and looking at the distribution for each number. Here is an example for zero.

Answer: Yes, they seem to be relatively well distributed for each number. Dead end here?

  1. Does GGPairs have anything to add?

The box plots look interesting.

  1. column name seems to have a consistent mean across the numbers and the majority of the data falls within the 250-575 range.

  1. Average pixel means seems a bit more varied and the majority falls within the 25-155 range.

  1. This let me on to thinking about pixel strength and whether or not the higher/stronger pixel range would be more distinctive than the lower?

There is a much clearer distinction between the different numbers/outcomes above 100 pixels, even more so above 150.

Double check whether this distinction applies across columns names as well?

Yes…

  1. I then checked the distribution for column names outside the majority 250-575 range.

The distribution is less even across the numbers particularly under 200 and over 700. This could help the algorithm. The problem is, a lot of them are under 150 which I was considering removing…

4. Conclusion

I’ll move to the Machine Learning stage with four different datasets:

  1. all columns with constant values and avg pixels values < 20 removed - TOTAL COLUMNS: 434

  2. as above but all columns with avg pixels values < 100 removed - TOTAL COLUMNS: 271

  3. as no.1 but all columns with avg pixels values < 150 removed - TOTAL COLUMNS: 191

  4. as no.2 but including columns wih names < 200 and > 700 - TOTAL COLUMNS: 331

  5. as no.3 but including columns wih names < 200 and > 700 - TOTAL COLUMNS: 263